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Abstract 

Motivated by applications in high-dimensional data analysis where strong 
signals often stand out easily and weak ones may be indistinguishable from 
the noise, we develop a statistical framework to provide a novel categorization 
of the data into the signal, noise, and indistinguishable subsets. The three- 
subset categorization is especially relevant under high-dimensionality as a large 
proportion of signals can be obscured by the large amount of noise. Under- 
standing the three-subset phenomenon is important for the researchers in real 
applications to design efficient follow-up studies. We develop an efficient data- 
driven procedure to identify the three subsets. Theoretical study shows that, 
under certain conditions, only signals are included in the identified signal sub- 
set while the remaining signals are included in the identified indistinguishable 
subsets with high probability. Moreover, the proposed procedure adapts to 
the unknown signal intensity, so that the identified indistinguishable subset 
shrinks with the true indistinguishable subset when signals become stronger. 
The procedure is examined and compared with methods based on FDR control 
using Monte Carlo simulation. Further, it is applied successfully in a real-data 
application to identify genomic variants having different signal intensity. 
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1 Introduction 



The problem of identifying a small number of signals from a large amount of noise is a 
central topic in modern statistics due to motivations from a wide spectrum of emerging 
applications. Examples include the detection of astrophysical sources, surveillance for 
disease outbreaks, identification of causal genetic markers, etc. In real applications, 
it is frequently observed that strong signals can easily stand out, while weak ones are 
often mixed indistinguishably with the noise. This phenomenon is especially relevant 
under high-dimensionality as a large proportion of signals can be obscured by the 
large amount of noise.. 

In this paper, we aim to extract valuable information from the data by categorizing 
the data into the signal, noise, and indistinguishable subsets. More specifically, we 
want to identify the signal subset in the data which includes only true signals, the 
noise subset which includes only noise, and the indistinguishable subset, where signals 
and noise cannot be separated. To formulate the problem rigorously, let 5*0 be the 
collection of noise in the data, and Si the collection of true signals. The p-value of 
the data 

Pi ~ + zG{l,...,n}, (1) 

where U is the uniform distribution on [0, 1] and G is some unknown continuous dis- 
tribution with G(t) > U{t) for all t G (0, 1). The p-values are ordered as P(i) < P(2) < 
■ • • < P{n)- Define as the separation point between the signal and indistinguishable 
subsets, and d^^ the separation point between the indistinguishable and noise subsets, 
i.e. = min{i : P(j) from a noise} — 1 and (i*^, = max{z : P(j) from a signal}. Our 
goal is to identify the three subsets by estimating (i* and (i**. 

Understanding the three-subset phenomenon can be important for the researchers 
in real applications to design appropriate follow-up studies and allocate their resources 
more efficiently. For instance, candidates in the signal subset may have priority 
for more focused study, while those in the noise subset can be removed; and, for 
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candidates in the indistinguishable sub set, additional data may be co 



separa t e weak signals from the nois e (IConneelv and Boehnkd (120101 ) 



(120091) 



Suresh and Chandrashekaral (120121 ). etc.). 



lected to further 



Spencer et al 



The proposed framework of three-subset categorization helps to enrich current 
studies in multiple testing, which largely focus on the dichotomy of rejecting versus 
not rejecting null hypotheses. By controlling false positives, multiple testing proce- 
dures identify strong signals with high confidence. Popular crit e ria fo r false positive 



contro l include family-wise error (FWER) control (IDudoit et al 



(120031), 



Dudoit et al 



(120041 ). etc.) and false discovery rate (FDR) control (Benjamini and Hochberg (1995, 



2000)). Recent developments in multiple testing focus o n improving the power of 



FDR pr ocedures and contro ll ing FDR under dependen ce 



(2004) 



(120071 ). 



Storey et a" 



Fan et al. 



mm 



Abramovich et al. 



(120061 ) 



Genovese and Wasserman 



Sun and Cai 



(120071 ) 



Efron 



(I2OI2I ). etc.). These studies, however, would not provide infor- 



mative results for the weak signals that are indistinguishable from the noise as these 
signals cannot be separated by controlling the selection of the noise alone. The higher 
the dimensionality is, the more indistinguishable signals are, and the less efficient the 
criterion of false positive control could be. This limitation can hinder meaningful 
applications of multiple testing procedures in ultra-high dimensional data analysis. 

To delineate the indistinguishable and noise subsets would require an adaptive 
bound for the range of the weak signals. As the signals are often very sparse compared 
to the amount of noise, it is a challenging task to provide a statistical framework 
to characterize the weak signals. For instance, power analysis in multiple testing 
is well known to be difficult due to the limited information about the true signals. 



Another example is in variable selection, where screening pro c edures are deve 



identify and then remove the noise su 



Fan et al. 



too% 



Fan et al. 



(1201 ih 



3set ( 



Fan 



and Lv 



Zhu et al. 



(120 iih 



(l2008h. 



o ped t o 



Ha 



1 and Milled (120091 1 . 

"1 



Li et all ( 20121 ). etc.). While 



significant efficiency has been demonstrated for these methods in handling ultra-high 
dimensional data, setting a good screening parameter remains a difficult problem as 
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it depends on the proportion and intensity of the non-zero coefficients, which are 
hard to be inferred from the data. Because of the inherent difficulty of weak signal 
infer ence, even though th e pheri omenon of three subsets has been frequently observed 



(e.g. iDrton and PerlmanI (120081 )). no rigorous statistical studies have been developed 
to explore the properties of the three subsets, neither is an efficient categorization 
method available up-to-date. 

In this paper, we demonstrate the existence of the signal, noise, and indistinguish- 
able subsets in Section [2] and connect the results with some recent developments in 
exact signal recovery. An efficient data-driven procedure called Two-Level Thresh- 
olding (TLT) is proposed in Section |2] to identify the three subsets by estimating the 
separation points d^, and (i**. d^, is estimated by the ffist level threshold d^,, which 
strongly controls false positives and only selects strong signals with high probability. 
The more challenging part is the construction of d**, the second level threshold for 
the separation point between the indistinguishable and noise subsets. We develop a 
data-driven step-down procedure that traverses the ordered p-values until all signals 
are likely to be included. We show that, under certain conditions, only signals are 
included in the identified signal subset while the remaining signals are included in the 
identified indistinguishable subset with high probability. 

Besides controlling false positives and false negatives, the proposed TLT procedure 
adapts to the intensity of the signal, so that the two thresholding levels move closer 
to each other as signals become stronger and the indistinguishable subset reduces in 
size. In the case when all signals are strong enough to be well-separated from the 
noise, the two thresholding levels converge to a single point. 

The construction of TLT is completely data-driven. No prior information of the 
data distribution is needed; neither are tuning parameters involved in the algorithm. 
The computation is very fast with complexity O(nlogn). These properties meet the 
needs of high- dimensional data applications. 

The rest of the paper is organized as follows. We ffist demonstrates the existence 



4 



of the three subsets in section [2l Then we introduce the construction of the TLT 
procedure with its theoretical properties for the identification of the three subsets 
in Section [3l Monte Carlo simulations are presented in Section H] to compare the 
results of TLT with those of the methods based on FDR control. Real-data results 
are provided in Section [5] where we apply our procedure to analyze SNP array data. 
We conclude in Section E] with further discussions. The proofs are relegated to the 
Appendix. 

2 Existence of The Three Subsets 

In this section we first present the sufficient and almost necessary conditions for the 
existence of the signal, noise, and indistinguishable subsets. The results are connected 
to the recent developments in exact signal recovery. A simulation example is shown 
to demonstrate the relationship between the sizes of the three subsets and the signal 
intensity. To allow a succinct theoretical study, we assume, in this section, that the 
observations are generated independently from a normal mixture, i.e., 

Xi ~ iV(0, l)l{.65o} + iV(/", l)l{.G5i}, z G {1, . . . , n}. (2) 

The following theorem shows the sufficient and almost necessary conditions for the 
existence of the three subsets. 

Theorem 2.1 Assume model Then, asymptotically, the sufficient and almost 
necessary condition for the existence of the signal subset is 

/i > v/2(l + e) logl^ol - V21og|5i|, (3) 

for the existence of the indistinguishable subset is 

V2(l-e)log|5o| + V21og|5i|, (4) 

and for the existence of the noise subset is 

log|5i|<(l-e)log|^o|, (5) 
5 



for any e > 0. 

Theorem 12.11 implies that (a) all three subsets exist when signals are sparse (15*11 = 
o{n)) and the signal intensity is between the two bounds in ([3]) and (jl]); (b) when 
signal intensity is too small (/i < a/2(1 + e) log l^o] — a/ 2 log | | ) , no signals stand 
outside the range of the noise, and only the indistinguishable and noise subsets exist; 
and (c) when signal intensity is large enough (// > \/2{l — e) log l^ol + y^2Tog]S7I), 
all signals are excluded from the range of the noise, and only the signal and noise 
subsets exist. Moreover, (jlj) shows that the higher the dimensionality is, the more 
likely that the indistinguishable subset exists. 



2.1 Connection to Exact Signal Recovery 



It is interesting to note that the sufficient and almost necessary condition for the 
existence of the in distinguish a ble su bset is closely related to the condition for exact 



signal recovery in 
calibrations: 



Ji and JinI ( 12012| ) and in 



XieetaL 



(120111 ). Adopting the similar 



vr = |5'i|/n = n ^ , < /3 < 1, and fi = fin = ^/2r\ogn, r > 0, (6) 
we have the following result. 

Corollary 2.1 Assume model ^ with calibration Then, asymptotically, the 

noise subset always exists, and the sufficient and almost necessary condition for the 
existence of the signal subset is 



(7) 



and for the indistinguishable subset is 



< (1 + ^l^f 



N ote that condition (|8]) delineates the complementary set of the exact recovery region 



m 



Ji and Jin 



( 120121 ). In other words, only when the indistinguishable subset does not 



exist is it possible to recover all signals with probability ~ 1. It is also interesting to 
see that condition ([7]) coincides with the detec tion boundary for the maximum statistic 



Mn = maxi<j<„{Xi} ( iDonoho and Jinll2004l ). This shows that only when the signal 



subset exists is it possible for the maximum statistic M„ to separate the hypotheses 
Ho-.Xi^ iV(0, 1), 1 < 2 < n and H^-.X^^ iV(0, l)l{,^So} + l)l{i^s,}, l<i<n. 

2.2 A Simulation Example 

The simulation example in this section demonstrates the relationship between the 
signal intensity and the sizes of the three subsets. The performance of the proposed 
TLT procedure is also presented in this example. We generated 10, 000 observations 
and calculate their p-values, among which 2% are from A^(/i, 1) and the rest from 
iV(0, 1). We set at 3, 4, and 7.5. When /i = 3, (4, 4*) = (65, 3090); when /i = 4, 
((i*, (i**) = (116, 928); and when /i = 7.5, (d*, d**) = (200, 200). The three subsets 
and (c/*, c/**) are delineated in Figure [1] (a) in log-scale for better view. It is clear 
that, as /i increases, the signal subset increases to include all true signals, and the 
indistinguishable subset decreases to an empty set. 

For the above example with /i = 4 and (c?*, c/*^,) = (116, 928), the distribution 
of the ranks of the signals is presented in Figure [1] (b). Our estimates (c?*, d^,^) = 
(72, 357), and clearly < d^, so that . . . are all from signals, d^^, = 

357, however, is much smaller than d^^ = 928, but . . . include 197 out 

of 200 signals, suggesting it as a reasonable estimate for the separation between 
the indistinguisha ble and noise regions. F o r corn parison, the cut-off point of the 



FDR procedure in 



Benjamini and HochbergI (119951 ) (BH-FDR) with the control level 



conventionally set at 0.05 is 172, which means BH-FDR selects . . . ,P(i72) from 
the ordered p- values. The cut-off point of BH-FDR is between d^, and d^^,, and larger 
than d*. Apparently, BH-FDR selects more signals than and a few noise, but still 
missing many of the signals. 
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(a) Three Subsets 



(b) Distribution of Signal Ranks 




Figure 1: (a) Three subsets on the rank sequence of ordered p-values in log-scale. Sig- 
nal subset (solid line), indistinguishable subset (dash-and-dot line), and noise subset 
(dot line) are separated by (i* and (i**. (b) Distribution of the signal ranks when = 4. 
(d*, d^:^) are indicated at (116, 928). Vertical lines at 72 and 357 represents the lo- 
cations of {d^, d^^). The vertical line at 172 represents the location of the BH-FDR 
threshold. 

3 Identification of The Three Subsets 



In this section, we first construct the TLT procedure to estimate the separation 
points between the signal and indistinguishable subsets and between the indistin- 
guishable and noise subsets, respectively. Similar to other adaptive procedures in 
multiple testing, we start with an estimate of the signal proportion vr = |S'i|/(|5'o U 
5*1 1). Various estimators have been developed in the literature under ce r tain c ondi- 



tions on the data distri 



jution. For example. 



Genovese and WassermanI (120041 ) and 



Meinshausen and Ricd (120061 ) p roposed two proportion estimators u nder a "purity" 



condition on the signal p-values. 



Cai et al. 



(120071 ). I Jin and Gail (120071 ) . and 



Jin 



too& \ 



developed proportion estimators for normally distributed observations. Given an esti- 
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mate if for the signal proportion, our estimator for the separation between the signal 
and indistinguishable subsets is defined as 

4 = max{z : p^i) < }, (9) 

(1 — TTjn 

where a„ is the tolerance level for false positives and a„ — !■ as n — !■ oo. The choice 
of the convergence speed of depends on how stringently one wants to control the 
family-wise type I error. Reasonable choice can be a„ = 1/ logn. can be regarded 
as an adaptive Bonferonni threshold. Its property of controlling false positives is 
relatively straightforward. The more challenging part is the construction of d**, the 
estimate of the separation between the indistinguishable and noise subsets. Even 
with the help of an estimate for signal proportion, one still does not know where the 
separation is since the signals are mixed with noise in the indistinguishable subset. 
Simply cutting at frn can include a lot of noise and miss many signals. We propose 
a data-driven procedure that traverses the ordered p-values until all signals are likely 
to be included. This cut is defined as 



, (i*. Tin < d^, 

4* = { (10) 
nn + min{j > 1 : p(i,n+j) < -^(7)^(/3n)}, otherwise, 

where F^^ is the inverse cumulative distribution function of Beta(j, (1 — Tf)n—j + 1), 
(3n is the tolerance level for false negatives, and /3„ — j- as n — !■ oo. A reasonable 
choice can be /3„ = 1/ log n. It is easy to see that d^^, is always greater than or equal 
to d^. In the case when fm < d^, all signals are likely to rank before d^,, then there is 
no need to go further along the ordered p- values, and we set d^,* = d^,. On the other 
hand, fm > d^, means that some signals are missing in the first d^, ordered p-values, 
so that we need to go further to find all the signals. The search for d^^, starts at fm, 
which is the estimated number of signals, and ends at the smallest j where pij^n+j) is 
no greater than the /3„-quantile of Beta(j, (1 — 7r)n — j + 1), which is the distribution 
of the j-th ordered p-value from (1 — Tx)n noise. The intuition here is that, suppose 
that not all signals rank before d**, then the number of noise in . . .pj-^- -j is likely 
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to be greater than d^^, — ifn. Denote j = d^^, — fm, then the j-th ordered p-value from 
(1 — TT)n noise is smaller than p^^ y This event, however, has a small probability /3„ 
due to the construction of c/** where p^^-^^-, < F^^{(3n)- 

Next, we present theoretical results on the properties of the two thresholding levels 
(Land d^,^. For simplicity, we utilize the proportion estimator of 



Meinshausen and Rice 



(120061 ). which is also constructed based on p- values. The estimator, defined as 

i/n- P(i) - v/21oglogn/n^p(i)(l 



TT 



max 

l<i<n/2 



1 - ViS) 

is plugged into ([9]) and ( ITOl) . Other proportion estimators can be used in the con- 
structions of and d^^, in a similar way. The v: in f|TT]) is a con sistent estimator under 



the fo llowing conditions as presented in Theorem 2 and 3 in 
( 120061 ). Let TT = n~'^ for some C G [0, 1). Assume either 



Meinshausen and Rice 



C e [0, 1/2) and inf G\t) = 0, 

te(o,i) 



(12) 



or 



C e [1/2, 1) and for all q G (0, 1), lim (logG-^(g))/(logn) 



-r, r > 2(C7-l/2). 

(13) 



Condition ( IT2l) considers relatively de nse signals with nn ^ and a 



purit y" condition inf(g(o,i) G'{t) = (iGenovese and WassermanI (120041 ) 



1 we need is the 



Meinshausen and Rice 



( 120061 )). Condition ( fT3|) considers sparse signals with irn < y/n. In this case, stronger 



condition is needed for signal intensity, which is implied by ([T^ . 

Now we show that with high probability, only signals are ranked before d^ and 
the number of signals ranked before d** converges to \Si\, the total number of signals. 
Let 

no{d) = number of noise in . . . ,P{d)} and ni{d) = number of signals in {p(i), ■ ■ ■ ,P(d)} 



for any integer d > 1. 
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Theorem 3.1 Assume model (uP and condition [W\) or [T3jl . Then with high prob- 
ability, only signals are ranked before the thresholding level d^,, and the number of 
signals ranked before the 2"*^ thresholding level (i** converges to the total number of 
signals. That is, as n ^ oo, 

PK(4) > 0) ^ (14) 

and 

^(ilr^'"')^° (15) 

for any e > 0. 

Theorem 13.11 shows that and (i** are conservative estimates, which control false 
positives and false negatives respectively. While one can always achieve conservative 
estimates at and n, the proposed estimators move closer to each other as signals 
become stronger and the indistinguishable subset gets smaller. When all signals are 
strong enough to be well-separated from the noise, and (i** converge to a single 
point. This adaptivity property of the TLT procedure is presented in the following 
theorem with G = 1 — G defined as the survival function of G. 

Theorem 3.2 Assume model (d). If signals are strong enough, such thatTinG{n~'^) 
for some r > 1. Then, with high probability, the indistinguishable subset does not 
exist, and for any an satisfying logn ^ loga;„ <^ 0, the signal and noise subsets are 
consistently separated by = ci**. That is, 

P{d, = d,, = \Si\) ^ 1 (16) 

as n ^ oo. 

An intuitive understanding for the condition irnG^n"^) — > 0, r > 1, is that G{n~^) ^ 
l/vrn = l/n^~'^ = o(l), which means that the total mass of G is asymptotically 
between and n~^' . Note that the expectation of the smallest value from n noise is 
n~^. Therefore, with r > 1, all the p- values of signals are well- separated from all the 
p-values of noise. 
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Theorem 13.11 and 13.21 are developed for d^, and d** with ti defined as in (11 II) . If 



other proportion estimators are used, conditions in t he theorems wif 
accordingly. For example, the proportion estimator in 



Cai et al. 



be changed 



( 120071 ) is designed 



for normally distributed noise and signals. Utilizing the additional properties of 
the distribution, this estimator is consistent under a weaker condi tion on the signal 
intensity in the sparse scenario compared to (IT^ ( ICai et al.l 120071 ). The theoretical 
properties of and d^^, in identifying the signal, noise, and indistinguishable subsets 
can be proved in a similar way. 

In real applications, data may not satisfy the conditions for the existence of a con- 
sistent proportion estimator. However, prior knowledge can often allow practitioners 
to provide a possible range for the signal proportion. We demonstrate that the study 
of signal, noise, and indistinguishable subsets can still be carried out utilizing such 
prior knowledge. Suppose vr is bounded by 



TT < TT < TT 



;i7) 



Define 



and 



Olr. 



d^ = max{i : pu) < — 

(1 — 7r~)n 



vr+n < d^, 



(19) 



vr+ra + min{j > 1 : P{n+n+j) < F^j^{/3n)}, otherwise, 
where F^^^ is the inverse cumulative distribution function for Beta(j, (1 — 7i~)n — j + 
1). The next theorem states that the modified version d^, and c/^^, can still serve as 
conservative estimates for the separation points d^, and c/^,*. 

Theorem 3.3 Assume model (QP and condition l[T7\ ). Then, with high probability, 
only signals are ranked before d^, and the number of signals ranked before d^^, converges 
to the total number of signals. That is, as n ^ oo. 



P(no(4) > 0) ^ 



(20) 
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and 

P{n,{d,,) < \Si\) ^ 0. (21) 

Although (c/*, c/**) may not be as close as (d*, d**) gets to (d*, d**), they can provide 
useful information of the signal, noise, and indistinguishable subsets in many appli- 
cations where conditions for the consistency of proportion estimation are hard to be 
satisfied and some informative prior knowledge of the signal proportion is available. 



4 Simulation 

In this section, we demonstrate, via simulation studies, the finite sample performance 
of the TLT procedure on the identification of the signal, noise, and indistinguish- 
able subsets. In each example, 10, 000 observations are generated, in which the 
noise data points are sampled from N{0,a) and signals from N{fi,l). The selec- 
tions of and (i** with a„ = = 1/(2 log n) ~ 0.05 are compared with those of 



the BH-FDR with a = 0.05 (IBeniamini and Hochberg 



(IBenjamini and Hochbergl (120001 ) 



1995 



) and t he adaptive FDR 



Genovese and WassermanI (120041 )). Setting a„ at 



1/(2 logn) for n = 10,000 results in a control level close to the conventional level 
(0.05) used by other methods, so that the results from different methods are compa- 
rable. /3n is set to be equal to for simplicity. 

Among the methods compared, BH-FDR is easiest to implement, while the others 
require estimating the signal proportion. The estimates and d^^, the cut-off point of 
BH-FDR {tpDR), the cut-off point of the adaptive FDR (taFDn), as well as the number 
of false positives (FP) and the number of false negatives (FN) for each procedure 
are computed. We repeatedly generate the observations and compute performance 
measures for 100 times in each simulation example. The median and mean absolute 
deviation (MAD) of these measures are reported for more robust comparison results 
against the outliers in the 100 replications. 

Example 1 shows the effect of signal intensity. Set a = 1 and vr = 0.01. Signal 
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mean /i varies from 2.5 to 5.5. Since the signal proportion is very small, the results 
of BH-FDR and the adaptive FDR are very close. To save space, the results of the 
latter are omitted in this example. Figure |2] presents the histograms of c/** from the 
100 replications for /i = 2.5 and 5.5. It shows that as signal intensity increases, the 
distribution of (i** becomes more concentrated. 

Table [U shows that the cut-off point of BH-FDR {Ifdr) is between (i*, the estimate 
of the separation between the signal and indistinguishable subsets (S-I Separation), 
and d**, the estimate of the separation between the indistinguishable and noise subsets 
(I-N Separation). As signal intensity increases, the indistinguishable subset shrinks 
and the cut-off locations of all three procedures move closer. As to the accuracy 
of identifying the signal, noise, and indistinguishable subsets, it is shown that the 
FPs of (i* and t_FDR are well controlled with t^D/i having a bit higher FP when 
signal intensity increases. This agrees with our intuition since BH-FDR applies a less 
stringent rule to control false positives. FP of d^^, however is not controlled as it is 
not supposed to be. Interesting results are shown for the FN of (i**. Among the 100 
signals, the proportions of mis-specified signals of d^^ are 28%, 11%, 3%, and 1% for 
/i = 2.5,3.5,4.5,5.5, respectively. Compared with the FN of t^DR; which has mis- 
specified proportions of 92%, 48%, 12%, and 1%, (i** has many fewer false negatives 
when signals are only moderately strong. This simulation shows that the proposed 
estimators and d** adapt to the signal intensity, and the identified indistinguishable 
subset between d^, and d^,* shrinks with increasing /i. 
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h£at{d-*} h£it{d") 



Figure 2: Histograms of d^^^ for fi = 2.5 and 5.5 from 100 replications. 



Table 1: Effect of signal intensity. Median and MAD (in parentheses) of d^,, tpoR, 
d**; and their corresponding FP and FN over 100 replications, ir is fixed at 1%. 





S-I Separation 
4 FP FN 


BH-FDR 

tpDR FP FN 


I-N Separation 
d** FP FN 


H = 2.5 
/i = 3.5 
H = 4.5 
H = 5.5 


3(1) 0(0) 97(1) 
17(3) 0(0) 83(3) 
54(4) 0(0) 46(5) 
86(3) 0(0) 14(3) 


8(4) 0(0) 92(4) 
54(7) 2(1) 48(6) 
92(4) 4(3) 12(3) 
103(1) 4(1) 1(1) 


325(269) 261(253) 28(19) 
194(113) 103(97) 11(9) 
126(44) 29(34) 3(3) 
104(9) 4(3) 1(1) 



Example 2 demonstrates the effect of signal proportion. Set a = 1 and jj, = 3. The 
signal proportion vr changes from 1% to 20%. As shown in Table [21 when vr increases, 
FP of d^, remains around 0. FN of (i*^, is also fairly robust over the different numbers 
of signals. BH-FDR and adaptive FDR, on the other hand, increases in both FP and 
FN with increasing signal proportion. 



Table 2: Effect of signal proportion, jj, is fixed at 3. 





S-I Scparat 


ion 




BH-FDR 






adapFDR 




I-N Separation 




ISil 




FP 


FN 


tpDR 


FP 


FN 


taFDR 


FP 


FN 




FP 


FN 


100 


8(3) 


0(0) 


92(3) 


27(7) 


1(1) 


74(6) 


27(7) 


1(1) 


74(6) 


227(140) 


147(135) 


21(12) 


500 


40(6) 


0(0) 


460(6) 


255(14) 


11(4) 


255(13) 


259(13) 


12(4) 


253(13) 


1119(494) 


645(462) 


31(28) 


1000 


81(8.9) 


0(0) 


919(9) 


637(21) 


28(4) 


392(18) 


647(18) 


31(4) 


386(18) 


1960(589) 


996(548) 


38(31) 


2000 


172(11) 


0(0) 


1828(10) 


1485(29) 


59(9) 


575(28) 


1543(32) 


72(11) 


529(24) 


3224(660) 


1268(614) 


46(32) 



Example 3 has heterogenous noise generated for 10% of the observations. With 



^ I n 1 1 rm n ^ n 



ij □ 
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signal intensity and proportion fixed at yU = 3.5 and vr = 1%, the proportion of liet- 
erogeneous noise is 10 times tfie proportion of signals. This example demonstrates 
a common scenario in real-data applications where unjustified artifacts causes het- 
erogeneity in the background noise. The heterogeneous noise in this example are 
randomly generated from A^(0, a) with a ~ Gamma(2,6'). Let the scale parameter 9 
vary from 0.5 to 2, which results in increasing variability for the noise. Due to the 
small signal proportion, the results of the adaptive FDR are very close to those of 
the BH-FDR and omitted in this example. Table [3] shows that FPs of all procedures 
increase with 9. FNs, on the other hand, are very stable. Theorem 13.31 provides some 
explanation for the robustness of d** in controlling false negatives. Since heteroge- 
neous noise can result in large jumps, the estimated proportion vr is larger than the 
true TT. Constructed using this tt, d^,^, is essentially the c?** in (fT9l) . which is built on 
an upper bound of the true vr. The theoretical property on false negative control is 
presented in fl2Tl) . 



Table 3: Robustness for heterogeneous noise. Set fi = 3.6 and vr = 1%. 





S-I Separation 




BH-FDR 




I-N Separation 








FP 


FN 


tpDR 


FP 


FN 


d** FP 


FN 


9 = 0.5 


22(4) 


5(3) 


82(3) 


69(9) 


15(4) 


45(6) 


196(67) 107(60) 


12(7) 


9 = 1 


53(7) 


35(6) 


81(4) 


132(12) 


71(9) 


38(4) 


443(180) 347(174) 


7(4) 


9 = 1.5 


94(10) 


75(9) 


80(4) 


195(15) 


130(13) 


35(4) 


556(230) 459(223) 


7(3) 


9 = 2 


134(12) 


113(10) 


80(4) 


249(12) 


182(10) 


33(4) 


556(179) 466(175) 


9(4) 



Example 4 generates autocorrelated observations with pij = a'*"-'' for a = 0, 0.5, 0.7 
and 0.9. The number of observations are reduced to 1,000 to save computation time. 
Set a = 1, IT = 0.05, /i = 3, and a„ = /3„, = 1/ logn. The results summarized in Table 
m are quite stable over different values of the autocorrelation parameter a with 
having slightly better control on false negatives for large a. 
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Table 4: Robustness under autocorrelation. Set n = 0.05 and fx = 3. 





S-I Separation 




BH-FDR 


I-N Separation 




d^, 


FP 


FN 


tpDR 


FP 


FN 




FN 


a = 


14(3) 


0(0) 


36(3) 


27(4) 


1(1) 


25(4) 


74(45) 30(33) 


7(7) 


a = 0.5 


13(4) 


0(0) 


37(4) 


24(7) 


1(1) 


27(7) 


69(35) 27(28) 


7(7) 


a = 0.7 


13(7) 


0(0) 


37(6) 


28(9) 


1(1) 


24(7) 


67(44) 25(33) 


5(8) 


a = 0.9 


14(13) 


0(0) 


36(13) 


29(17) 


0(0) 


20(15) 


72(51) 27(41) 


3(4) 



5 Real Application 



We apply the three-subset identification to the g enotyping data fro m the Autism 



20091 ) generated by 



Genetics Resource Exchange (AGRE) collection (IBucan et al 
high-throughput single nucleotide polymorphism (SNP) array technology. Genotypes 
in this data set are measured in Log R ratio (LRR), which is calculated at each SNP 
location as log2{Robs/ Rexp), where Robs is the observed total inte nsity of both maio r 



20061 ). 



and minor alleles and Rexp is computed from a reference genome (iPeiffer et al. 
LRR data are widely used for detecting copy number variants (CNVs), in which the 
goal is to identif y genomic regions with deletion or duplication of DNA segments 



( iFeuk et al 



20061 ). Such DNA mutations hav e be reported to play importan t roles in 



population diversity and disease association ( iMcCarroll and Altshuler 



20071 ). Due to 



the fact that the intensity ratio deviates from the baseline in CNV segments, various 



segme nt detection method s have been de v eloped to detect CNVs 



data (Olshen et a" 



mm, 



Jeng et al 



(120041) 



Zhang et al. 



(120 101 ) 



j ased o n SNP array 



Siegmund et al. 



torn 



Jeng et al. 



mm . etc.) 

In this paper, instead of just providing a list of candidates for CNVs, we pro- 
vide more insight of the data by identifying the signal, noise, and indistinguishable 
subsets. We specifically consider the observations on Chromosome 19 for three in- 
dividuals, which are collected from 9501 SNPs for each individual. The signals are 
copy number deletions, which may cause LRR to be negative. For a given individual, 



17 



LRR observations are first normalize d, and then t 
each interval with length < L as in 



le likelihood ratio is calculated for 



Jeng et al. 



( 120101 ) ■ The likelihood ratio of an 



interval is defined as the standardized sum of observations in that interval, and L is 



20091 ). There are 



set at 20 as most of the CNVs cover less than 20 SNPs (IZhang et al.l 
n = 9501 X 20 = 190,020 such likelihood ratio statistics for each individual. When 
the distribution of LRR changes in an interval, the corresponding likelihood ratio 
is expected to deviate from the baseline. Figure [3] demonstrates the distribution of 
the likelihood ratios for all the intervals with length < L on Chromosome 19 of one 
individual. The outliers in the left tail are likely to come from copy number deletions. 
The plots are similar for other individuals and are, thus, omitted. 



CD 
CD 




I 1 1 1 

-10 -5 5 



Likelihood Ratio 

Figure 3: Histogram of the likelihood ratios of the intervals on Chromosome 19. 

We calculate the p- values for these likelihood ratios assuming that the background 
noise follow A^(0, 1) after normalization. The likelihood ratios are locally dependent 
due to the fact that the intervals are short and overlapping. In this example we treat 
them as independent observations to illustrate the method. The separations among 
signal, indistinguishable, and noise subsets are determined by either (i^, (|9]) and d^,* 
( !T0|) or (ITS!) and d^^ ( IT9|l . We find that estimating the signal proportion by ( ITTj) 
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seems to result in a much larger proportion estimate than commonly expected for 
SNP array data, pos sibly due to the artifacts involved in the data generation process 



(IMarioni et al. 



20071). Thus, we use a more reasonable bound of < tt < 0.005 for this 



data set. Setting the upper bound at 0.005 mea ns that the copy n umber deletions on 



20091 ). The signal, noise. 



Chromosome 19 are approximately less than 50 (IZhang et al.l 
and indistinguishable subsets are identified by deriving the cut-off points and (i*^,. 
Because the intervals are overlapping, we only keep intervals having minimum p- values 
among overlapping segments to indicate the locations of copy number deletions. All 
the other intervals overlapping with them are removed, (i* and (i^,* are then re-defined 
as the ranks among these non-overlapping intervals. For the three individuals, (d*, 
(i**) are (2, 18), (1, 76), and (1, 36), respectively. 

We further perform validation on the identified signal, noise, and indistinguishable 
subsets. The candidates in each subset are compared to the reported members in a 
CNV database maintained in The Centre for Applied Genomics 



(http:/ /projects. tcag.ca/variation/project. html). A candidate region can overlap with 
zero, one, or more than one CNVs in the database. The mean value of the number of 
such CNVs in the database is presented for each subset in Table |5l In other words, let 
Oj = number of CNVs in the database that overlap with the j-th candidate in the list 
of ranked intervals. Define ovlap-s = mean(Oj, 1 < j < (i*), ovlap-i = mean(Oj, d^, < 
j < d^,^), ovlap-n = mean(Oj, rf*,,, < j < total number of intervals). Table [5] shows 
that these mean values, in general, decrease from ovlap-s to ovlap-n. For example, 
in the identified indistinguishable subset of individual 3, 6.8 CNVs in the database 
overlap with each candidate in the identified indistinguishable subset on average, 
while the number decreases to 2.0 for the identified noise subset. This agrees with 
our intuition for the three subsets as larger mean values represents stronger evidence 
for identifying the true CNVs. One exception is ovlap-s for individual 3. There is 
only one candidate in the identified signal subset, which happens to be missed in the 
database. A possible explanation is that this candidate is a de novo CNV only car- 



19 



ried by individual 3. The sample correlation between the interval length and Oj are 
0.17, 0.28, and 0.26 for the three individuals, respectively, indicating that the trend 
observed in Table [5] is not likely caused by the length factor. 

Table 5: Estimated separations, d.^ and d^,^, and the mean value of Oj in each subset. 





ovlap-s 




ovlap-i 




ovlap-n 


Individual 1 


10.5 


2 


3.4 


18 


2.5 


Individual 2 


4 


1 


4.7 


76 


2.1 


Individual 3 





1 


6.8 


36 


2.0 



6 Further Discussion 

In this paper, we developed a novel statistical framework and an efficient TLT proce- 
dure to categorize the data into the signal, noise, and indistinguishable subsets. This 
unique categorization can provide further insight for the data and help the practi- 
tioners to design more appropriate follow-up studies to identify the true signals in 
different subsets. Another motivation for the new development is its potential to 
provide an objective criterion for sample-size determination based on the cardinal- 
ity of the indistinguishable subset. Unlike traditional sample-size calculation, which 
is based on a pre-specified level of signal intensity, we may determine whether the 
sample size is large enough by examining the size of the indistinguishable subset. 

Additional insight for the quality of the data may also be achieved by examining 
the indistinguishable subset. For example, a large indistinguishable subset suggests 
that there are many small non-null observations, which are either true signals or, 
very often, caused by artifacts involved during the data generation. Investigating 
the sources of possible artifacts in follow-up studies may significantly reduce the 
indistinguishable subset and result in better separation between signals and noise. 

We developed two related TLT schemes, one is completely data-driven, the other 
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utilizes prior knowledge on the possible range of the signal proportion. Such flexibility 
allows practitioners to meet the needs of various applications. The computation for 
both procedures are very fast. 

The study in this paper is based on p- values. Other statistics carryi n g information 



Sun and Cai 



about signal intensity, such as the local FDR values (lEfronl ( 120071 ) 
(120071 )) may be used in place of p- values. It will be interesting to investigate this 
possibility in future research. 

In this paper, we assumed independent p-values to allow a succinct theoretical 
study of the new method. Simulation examples in section H] demonstrate the robust- 
ness of the proposed method for autocorrelated observations. We plan to study in 
depth the three- subset categoriza tion under dependence in future works. We flnd the 



recent paper by 



Fan et al. 



(120121 ) to be very helpful. According to their work, it is 
possible to estimate the arbitrary dependence structure of the p- values and transform 
the dependent p-values into weakly dependent ones. 

Last but not least, estimating the separation point between the indistinguish- 
able and noise subsets can be related to the problem of variable screening in high- 
dimensional regression and can provide new insights on the well-known challenge of 
screening parameter selection in high-dimensional data analysis. 
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Appendix: Proofs 

The proofs for theorems in section [2] and [3] are provided. A preliminary lemma is first 
introduced to summarize part of the results in Theorem 1 and 2 in Meinshausen and 
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Rice (2006). The proof of the lemma is omitted. 

Lemma 6.1 Assume the same conditions as in Theorem \3.1\ Let n be defined as in 
p7]) . Then for any given e > 0, 

P((l - e)7r < TT < tt) ^ 1 

as n — )■ oo. 



Proof of Theorem 12.11 

It is sufficient to show the fohowing claims. For any e > 0, 



P{$ signal subset) = o(l) 


given 




V2(l 


+ e) log 




- V21og|Si|, 


(22) 


P(3 signal subset) = o(l) 


given 




V2(l 


- e) log 


d^ol 


- V21og|Si|, 


(23) 


P($ indist. subset) = o(l) 


given 




V2(l 


-e)lo| 




+ V21og|5i|, 


(24) 


P(3 indist. subset) = o(l) 


given 




\/2(l 


+ e) lo! 


?|5o| 


+ v/21og|5i|. 


(25) 


P{$ noise subset) = 


0(1) 


given 


log 


l-^il < 


(1- 


e) log 15*01, 


(26) 


P(3 noise subset) = 


0(1) 


given 


log 


l^i|> 


(1 + 


e) logl^ol. 


(27) 



Consider ([22]) first. 

P(^ signal subset) = P(max{Xj,z G Si} < max{Xj,i G So}) 

< P(max{X„2 G 5i} < V21og|5o|) + P(max{X,,^ G 5o} > A/21og|5o|) 

< P(max{X,,^G 5i} < V21og |^o|) + o(l), (28) 

where the last inequality is by the extreme value theory of standard normal random 
variables. Also, 

P(max{Xi,i G Si} < v/21og|5o|) 
= P(max{Xi, i e Si} - fi < a/2 log l^ol - /i) 

= P(max{X„2 G 5i} - < A/21og|5i| + (v/21og |So| - /i - V21og |5i|)) 
< P(max{X,, ^ G 5i} - /i < V21og + (v/21og |5o| - ^2(1 + e) log \So\)) 
= 0(1), (29) 
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where the inequahty is by yU > 1^2(1 + e) log l^ol — Y/2log]5i|. Combining f l28|l and 
([29]) gives ([22D. 

Next consider ( l23i) . 

P(3 signal subset) 
= P{max{Xi,i G Si} > max{Xi,i G So}) 

< P{max{Xi,i G Si} > \J 2 log 1 5*0 1 — log log n) + P(max{Xj,z G Sq} < \J 2 log | | — log log n) 

< P(max{X„i G ^i} > V21og |^o| - loglogn) + o(l), (30) 

where the last inequality is by the extreme value theory of standard normal random 
variables. Also, 

P(max{Xj,z G 5*1} > a/ 2 log 1 5*0 1 — log log n) 
= P(max{X„2 G ^1} - /i > ^Slog + (v/21og |5o| - loglogn - /i - V21og |5i|)) 

< P(max{X,,^ G 5i} - /i > V21og + (v/21og |^o| - log log n - ^2(1 - e) log \So\)) 
= o{l), (31) 

where the inequality is by /i < a/2(1 — e) log l^ol — a/ 2 log | | . Combining ( 130|) and 
dM]) gives ([23D. 

The claims in (l24l) - ( 1271) can be proved in similar ways. 

□ 

Proof of Theorem 13.11 

Consider (fH]) first. Since 

P(no(4)>0) < P{3ieSo:Pi< ,^ ) 

(1 — TTjn 

< (1 - 7r)n • PiPi < TT < tt) + P(7r > vr) 

(1 — -njn 

< a„ + o(l), 

where the third inequality is by Lemma 16.11 and the fact that p- values from noise are 
uniformly distributed. Then (fT4|) follows. 
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Next, consider ( JTSjl . Define rii = \Si\. Recall that j = d^^ — fm, then 

P{ni{d^,^) < (1 - e)ni) = P{no{d^^,) > d^^ - (1 - e)ni) 

= P{no{d**) > vrn + j — (1 — e)7rn) 
= P(no((i**) > (tt - (1 - e)7r)n + j) 

< PK(c/„) > (vr - (1 - e)7r)n + j, vr > (1 - e)7r) + P(7r < (1 - e)7r) 

< P(no(4.) > + (32) 

where the first equality is by d** = r;,o((i**)+?T-i((i**), the second equality is by rii = nn, 
and the last step is by Lemma 16.11 

In the case of nn < d^, we have d^^ = d^, = nn and j = 0. Then 

PK(4*) > j) = PK(4) > 0) ^ (33) 

by (HID. (dS]) follows by combining and m\\ . 

In the case of ffn > d^, define P^^-j as the j-th smallest p- value from hq noise. Then 

P{no{L)>j) < P{P^.^<p^^J 

< P (BetaQ, no - j + 1) < i^(j)'(/3n)) 

< P (Peto(j, (1 - 7r)n - j + 1) < i^(])'(/3n), TT < vr) + P(7r > vr) 

= /3n + o(l), (34) 

where the first inequality is because when the elements from 5*0 are more than j in 
{!,... the jth smallest p- value from 5*0 must rank before the value at (i**. 

The second inequality is by the well-known fact that P^^-j ~ Beta{j,no — j + 1) and 
the construction of d**, where < F^^{(3n)- The last step is by the definition of 

F^-^ and Lemma [6. 1[ Combining (1521) and flM|) gives f[T51) . 

Proof of Theorem 13.21 

Defines events A = {d^, > vrn}, B = {d^ < nn}, and C = {fm = nn}. By the 
construction of d^ in it is enough to show that 

P(AnPnC) ^ 1, 
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which is imphed by 

P{A^) + P(fi") + P{C') 0. (35) 

Consider P{A'^) first. 

PiA") < P{d^ < Tin) + P(7r > tt) 

< P{3ieS,:P,> . \. ) + o(l) 

(1 — Ti)n 

^ -5(^)+o(l) 

< 7™G(n^") + o(l) = o(l), 

where the second inequahty is by the construction of d^, in ([H]) and Lemma 16.11 the 
fourth inequahty is by ^^"^^^ > ra"^ when a„, ^ ra"'^ and r > 1, and the last step is 
by the condition 7rnG{n~^) — t- 0. 

For P{B'^), it is easy to show that P{B^) = P{nQ{d^:) > 0) — by similar 
arguments leading to ( !T4|) . 

Now consider P{C'^). By lemma WA\ it is enough to show that 

Pinn < vrn — 1) — )■ 0, 



which is implied by 



P(--l< — ^)^0. (36) 
TT Tin 



Define 

F„(t) = - 5^ 1(P. < t), f/„o(t) = - 5^ l(Pf < t), G„,(t) = - 5^ l(Pf ) < t). 

1=1 2=1 1 = 1 

Then, by the construction of tt in ( ITTl) . for any t G [0, 1], 



i_l > ^n(t)-t-7r v/21oglo gnv/t(l-t) 



7r 7r vrvra 



:i-7r)t/„o(t)+vrG„i(t)-t-7r V2 log log 71^^(1 - t) 



7r vrVn 



(G(i) - 1) H- (G„.(i) - GW) + W - i) - * v'^T^V^O^ 



TT vTvn 
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Let t = n ^. Then by condition 7mG{n *")—)■ and r > 1, 

\G{t)-l)\=G{n-^) = oi^), 



\Un,{t) -t\= Op ^ = Op^^ — = Op — 



v/21oglognv/t(l-t) ^ v/21oglogn 1 ^ 1 , 



Therefore, (l36l) follows. Combining the above results for P(y4'^), P[B'^), and P{C'^) 
gives (I35l) . 

Proof of Theorem [3:2] 

The proof of this theorem is similar to that of Theorem 13.11 and is, thus, omitted 
to save space. 
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